Enhancing Scalability of Sparse Direct Methods

نویسندگان

  • X. S. Li
  • J. Demmel
  • L. Grigori
  • M. Gu
  • J. Xia
  • S. Jardin
  • C. Sovinec
  • L.-Q. Lee
چکیده

TOPS is providing high-performance, scalable sparse direct solvers, which have had significant impacts on the SciDAC applications, including fusion simulation (CEMM), accelerator modeling (COMPASS), as well as many other mission-critical applications in DOE and elsewhere. Our recent developments have been focusing on new techniques to overcome scalability bottleneck of direct methods, in both time and memory. These include parallelizing symbolic analysis phase and developing linear-complexity sparse factorization methods. The new techniques will make sparse direct methods more widely usable in large 3D simulations on highly-parallel petascale computers. 1. The SciDAC applications 1.1. Fusion energy research The Center for Extended Magnetohydrodynamic Modeling (CEMM) [1] is developing simulation codes for studying the nonlinear macroscopic dynamics of MHD-like phenomena in fusion plasmas, and address critical issues facing burning plasma experiments such as ITER. Their code suite includes M3D-C1, NIMROD. The PDEs include many more physical effects than the standard MHD equations; they involve large, multiple time-scales, and are very stiff temporally, therefore requiring implicit methods. The linear systems are extremely ill-conditioned, and 5090% of the execution time is spent in linear solvers. TOPS sparse direct solver SuperLU has played significant roles in both codes. For the large 3D matrix-free formulation in NIMROD, SuperLU is also used as an effective preconditioner for the global GMRES solver. 1.2. Accelerator design The Community Petascale Project for Accelerator Science and Simulation (COMPASS) [2] is developing parallel simulation tools with integrated capabilities in beam dynamics, electromagnetics, and advanced accelerator concept modeling for accelerator design, analysis, and discovery. Their code suite includes Omega3P eigenanalysis. The eigen computations for cavity mode frequencies and field vectors are on the critical path of the shape optimization cycles. These need to be done repeatedly, accurately, and quickly. Another challenge is that a large number of small nonzero eigenvalues, which are tightly clustered, is desired. The matrix SciDAC 2007 IOP Publishing Journal of Physics: Conference Series 78 (2007) 012041 doi:10.1088/1742-6596/78/1/012041 c © 2007 IOP Publishing Ltd 1 Name Codes Type Order (N) nnz(A)/N Fill-ratio matrix181 M3D-C1 Real 589,698 161 9.3 matrix211 M3D-C1 Real 801,378 161 9.3 cc linear2 NIMROD Complex 259,203 109 7.5 dds15 Omega3P Real 834,575 16 40.2 Table 1. Characteristics of the sample matrices. The sparsity is measured as average number of nonzeros per row (i.e., nnz(A)/N), and the Fill-ratio shows the ratio of number of nonzeros in L+U over that in A. Here, MeTiS is used to reorder the equations to reduce fill. 1 8 32 128 256 20 40 60 80 100 120 140 160 180 200 IBM power5 processors S ec on ds Factorization matrix181 matrix211 cc_linear2 dds15 (a) Factorization time. 1 8 32 128 256 0.5 1 1.5 2 2.5 3 IBM power5 processors S ec on ds Triangular solution matrix181 matrix211 cc_linear2 dds15 (b) Triangular solution time. Figure 1. SuperLU runtime (seconds) for the linear systems from the SciDAC applications. This was done on the IBM Power 5 machine at NERSC. The factorization reached 161 Gflops/s flop rate for matrix211. dimensions can be tens to hundreds of millions. The main method used is shift-invert Lanczos, for which the shifted linear systems are solved with a combination of direct and iterative methods. 1.3. SuperLU efficiency with these applications SuperLU [6] is a leading scalable solver for sparse linear systems using direct methods, of which the development is mainly funded through the TOPS SciDAC project (led by David Keyes) [7]. Table 1 shows the characteristics of a few typical matrices taken from these simulation codes. Figure 1 shows the parallel runtime of the two important phases of SuperLU: factorization and triangular solution. The experiments were performed on an IBM Power 5 parallel machine at NERSC. In strong scaling sense, the factorization routine scales very well, although performance varies with applications. The triangular solution takes very small fraction of the total time. On the other hand, it does not scale as well as factorization, mainly due to large communication to computation ratio and higher degree of sequential dependencies. One of our future tasks is to improve scalability of this phase, since in these application codes, the triangular solution often needs to be done several times with respect to one factorization. In the last year or so, we have been focusing on developing new algorithms to enhance scalability of our direct solvers. The new results are summarized in the next two sections. 2. Improving memory scalability of SuperLU – parallelizing symbolic factorization Symbolic factorization is a phase to determine the nonzero locations of the L. U factors. In most parallel sparse direct solvers, this phase is performed in serial, with matrix A being available on SciDAC 2007 IOP Publishing Journal of Physics: Conference Series 78 (2007) 012041 doi:10.1088/1742-6596/78/1/012041

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain Decomposition Based High Performance Parallel Computing

The study deals with the parallelization of finite element based Navier-Stokes codes using domain decomposition and state-ofart sparse direct solvers. There has been significant improvement in the performance of sparse direct solvers. Parallel sparse direct solvers are not found to exhibit good scalability. Hence, the parallelization of sparse direct solvers is done using domain decomposition t...

متن کامل

Enhancing Parallelism in Monte Carlo Techniques for Solving Large Sparse Linear Systems

The problem of solving large scale sparse linear systems arises in many scientific and engineering applications. Recent advances in multicore processors and clusters that consist of hundreds of thousands of cores motivate new techniques to solve such problems efficiently. Two main design considerations are the accuracy of the solution and the scalability of the method, which compete against eac...

متن کامل

A New IRIS Segmentation Method Based on Sparse Representation

Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...

متن کامل

A New IRIS Segmentation Method Based on Sparse Representation

Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...

متن کامل

An Efficient Solver for Sparse Linear Systems Based on Rank-Structured Cholesky Factorization

Direct factorization methods for the solution of large, sparse linear systems that arise from PDE discretizations are robust, but typically show poor time and memory scalability for large systems. In this paper, we describe an efficient sparse, rank-structured Cholesky algorithm for solution of the positive definite linear system Ax = b when A comes from a discretized partial-differential equat...

متن کامل

PSPIKE: A Parallel Hybrid Sparse Linear System Solver

The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) performance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007